17 research outputs found

    OVERLAPPED-SPEECH DETECTION WITH APPLICATIONS TO DRIVER ASSESSMENT FOR IN-VEHICLE ACTIVE SAFETY SYSTEMS

    Get PDF
    ABSTRACT In this study we propose a system for overlapped-speech detection. Spectral harmonicity and envelope features are extracted to represent overlapped and single-speaker speech using Gaussian mixture models (GMM). The system is shown to effectively discriminate the single and overlapped speech classes. We further increase the discrimination by proposing a phoneme selection scheme to generate more reliable artificial overlapped data for model training. Evaluations on artificially generated co-channel data show that the novelty in feature selection and phoneme omission results in a relative improvement of 10% in the detection accuracy compared to baseline. As an example application, we evaluate the effectiveness of overlapped-speech detection for vehicular environments and its potential in assessing driver alertness. Results indicate a good correlation between driver performance and the amount and location of overlapped-speech segments

    Analysis and detection of cognitive load and frustration in drivers' speech

    Get PDF
    Non-driving related cognitive load and variations of emotional state may impact a driver’s capability to control a vehicle and introduces driving errors. Availability of reliable cognitive load and emotion detection in drivers would benefit the design of active safety systems and other intelligent in-vehicle interfaces. In this study, speech produced by 68 subjects while driving in urban areas is analyzed. A particular focus is on speech production differences in two secondary cognitive tasks, interactions with a co-driver and calls to automated spoken dialog systems (SDS), and two emotional states during the SDS interactions - neutral/negative. A number of speech parameters are found to vary across the cognitive/emotion classes. Suitability of selected cepstral- and production-based features for automatic cognitive task/emotion classification is investigated. A fusion of GMM/SVM classifiers yields an accuracy of 94.3% in cognitive task and 81.3% in emotion classification

    SPEAKER VERIFICATION BASED PROCESSING FOR ROBUST ASR IN CO-CHANNEL SPEECH SCENARIOS

    No full text
    ABSTRACT Co-channel speech, which occurs in monaural audio recordings of two or more overlapping talkers, poses a great challenge for automatic speech applications. Automatic speech recognition (ASR) performance, in particular, has been shown to degrade significantly in the presence of a competing talker. In this paper, assuming a known target talker scenario, we present two different masking strategies based on speaker verification to alleviate the impact of the competing talker (a.k.a. masker) interference on ASR performance. In the first approach, frame-level speaker verification likelihoods are used as reliability measures that control the degree to which each frame contributes to the Viterbi search, while in the second approach timefrequency (T-F) level speaker verification scores form soft masks for speech separation. Effectiveness of the two strategies, both individually and in combination, are evaluated in the context of ASR tasks with speech mixtures at various signal-to-interference ratios (SIR), ranging from 6 dB to -9 dB. Experimental results indicate efficacy of the proposed speaker verification based solutions in mitigating the impact of the competing talker interference on ASR performance. Combination of the two masking techniques yields reductions as large as 43% in word error rate

    Who shall I say is calling? Validation of a caller recognition procedure in Bornean flanged male orangutan (Pongo pygmaeus wurmbii) long calls

    Full text link
    Acoustic individual discrimination has been demonstrated for a wide range of animal taxa. However, there has been far less scientific effort to demonstrate the effectiveness of automatic individual identification, which could greatly facilitate research, especially when data are collected via an acoustic localization system (ALS). In this study, we examine the accuracy of acoustic caller recognition in long calls (LCs) emitted by Bornean male orangutans (Pongo pygmaeus wurmbii) derived from two data-sets: the first consists of high-quality recordings taken during individual focal follows (N = 224 LCs by 14 males) and the second consists of LC recordings with variable microphone-caller distances stemming from ALS (N = 123 LCs by 10 males). The LC is a long-distance vocalization. We therefore expect that even the low-quality test-set should yield caller recognition results significantly better than by chance. Automatic individual identification was accomplished using software originally developed for human speaker recognition (i.e. the MSR identity toolbox). We obtained a 93.3% correct identification rate with high-quality recordings, and 72.23% with recordings stemming from the ALS with variable microphone-caller distances (20–420 m). These results show that automatic individual identification is possible even though the accuracy declines compared with the results of high-quality recordings due to severe signal degradations (e.g. sound attenuation, environmental noise contamination, and echo interference) with increasing distance. We therefore suggest that acoustic individual identification with speaker recognition software can be a valuable tool to apply to data obtained through an ALS, thereby facilitating field research on vocal communication

    CRSS systems for 2012 NIST speaker recognition evaluation

    Get PDF
    This paper describes the systems developed by the Center fo
    corecore